Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[spark] Automatically shut down ray on spark cluster if user does not execute commands on databricks notebook for a long time #31962

Merged

Conversation

WeichenXu123
Copy link
Contributor

@WeichenXu123 WeichenXu123 commented Jan 26, 2023

Signed-off-by: Weichen Xu [email protected]

Why are these changes needed?

Automatically shut down ray on spark cluster if user does not execute commands on databricks notebook for a long time.

Databricks Runtime provides an API:
dbutils.entry_point.getIdleTimeMillisSinceLastNotebookExecution() that returns elapsed milliseconds since last databricks notebook code execution.
This PR code calls this interface to monitor notebook activity and shut down Ray cluster on timeout.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
python/ray/util/spark/databricks_hook.py Outdated Show resolved Hide resolved
python/ray/util/spark/databricks_hook.py Outdated Show resolved Hide resolved
python/ray/util/spark/databricks_hook.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this case:

the notebook cell is

ray.init()
result = long_running_task.remote()

this cell will finish running immediately and the next cell is executed >30 minutes later

ray.get(result)

will the ray cluster gets terminated between these two cells?

@WeichenXu123
Copy link
Contributor Author

What about this case:

the notebook cell is

ray.init()
result = long_running_task.remote()

this cell will finish running immediately and the next cell is executed >30 minutes later

ray.get(result)

will the ray cluster gets terminated between these two cells?

In this case, the notebook status will be running (blocking on ray.get(result) command), so the getIdleTimeMillisSinceLastNotebookExecution will always return 0 duration remote task execution.

Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
@jjyao
Copy link
Collaborator

jjyao commented Jan 31, 2023

There are related test failures:

python/ray/tests/spark/test_databricks_hook.py:3:1: F401 'mock' imported but unused
 ```

Signed-off-by: Weichen Xu <[email protected]>
@WeichenXu123
Copy link
Contributor Author

There are related test failures:

python/ray/tests/spark/test_databricks_hook.py:3:1: F401 'mock' imported but unused
 ```

Addressed.

Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
@jjyao jjyao merged commit 3a1709f into ray-project:master Jan 31, 2023
clarng pushed a commit to clarng/ray that referenced this pull request Jan 31, 2023
… execute commands on databricks notebook for a long time (ray-project#31962)

Databricks Runtime provides an API:
dbutils.entry_point.getIdleTimeMillisSinceLastNotebookExecution() that returns elapsed milliseconds since last databricks notebook code execution.
This PR code calls this interface to monitor notebook activity and shut down Ray cluster on timeout.

Signed-off-by: Weichen Xu <[email protected]>
@harupy
Copy link

harupy commented Feb 1, 2023

@WeichenXu123 Can you add test cases for the following cases?

  • DATABRICKS_RAY_ON_SPARK_AUTOSHUTDOWN_MINUTES is not a number
  • DATABRICKS_RAY_ON_SPARK_AUTOSHUTDOWN_MINUTES is a negative number

@WeichenXu123
Copy link
Contributor Author

Filed a follow-up PR #32162

jjyao pushed a commit that referenced this pull request Feb 2, 2023
…ng messages (#32162)

See follow-up comments in #31962

Signed-off-by: Weichen Xu <[email protected]>
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
… execute commands on databricks notebook for a long time (ray-project#31962)

Databricks Runtime provides an API:
dbutils.entry_point.getIdleTimeMillisSinceLastNotebookExecution() that returns elapsed milliseconds since last databricks notebook code execution.
This PR code calls this interface to monitor notebook activity and shut down Ray cluster on timeout.

Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
…ng messages (ray-project#32162)

See follow-up comments in ray-project#31962

Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants